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Abstract 

We show how to compute lower bounds for the supremum Bayes error if the class-conditional distributions must 
satisfy moment constraints, where the supremum is with respect to the unknown class-conditional distributions. Our 
approach makes use of Curto and Fialkow's solutions for the truncated moment problem. The lower bound shows that 
the popular Gaussian assumption is not robust in this regard. We also construct an upper bound for the supremum 
Bayes error by constraining the decision boundary to be linear. 

Index Terms 

Bayes error, maximum entropy, moment constraint, truncated moments, quadratic discriminant analysis 



I. Introduction 

A standard approach in pattern recognition is to estimate the first two moments of each class-conditional 
distribution from training samples, and then assume the unknown distributions are Gaussians. Depending on the 
exact assumptions, this approach is called linear or quadratic discriminant analysis (QDA) ||T|, Q. Gaussians are 
known to maximize entropy given the first two moments l^l and to have other nice mathematical properties, but how 
robust are they with respect to maximizing the Bayes error? To answer that, in this paper we investigate the more 
general question: "What is the maximum possible Bayes error given moment constraints on the class-conditional 
distributions?" 

We present both a lower bound and an upper bound for the maximum possible Bayes error. The lower bound 
means that there exists a set of class-conditional distributions that have the given moments and have a Bayes error 
above the given lower bound. The upper bound means that no set of class-conditional distributions can exist that 
have the given moments and have a higher Bayes error than the given upper bound. 

Our results provide some insight into how confident one can be in a classifier if one is confident in the estimation 
of the first n moments. In particular, given only the certainty that two equally-likely classes have different means 
(and no trustworthy estimate of their variances), we show that the Bayes error could be 1/2, that is, the classes 
may not be separable at all. Given the first two moments, our results show that the popular Gaussian assumption 
for the class distributions is fairly optimistic - the true Bayes error could be much worse. However, we show that 
the closer the class variances are, the more robust the Gaussian assumption is. In general, the given lower-bound 
may be a helpful way to assess the robustness of the assumed distributions used in generative classifiers. 

The given upper bound may also be useful in practice. Recall that the Bayes error is the error that would arise 
from the optimal decision boundary. Thus, if one has a classifier and finds that the sample test error is much higher 
than the given upper bound on the worst-case Bayes error, two possibilities should be considered. First, it may 
imply that the classifier's decision boundary is far from optimal, and that the classifier should be improved. Or, it 
could be that the test samples used to judge the test error are an unrepresentative set, and that more test samples 
should be taken to get a useful estimate of the test error. 

There are a number of other results regarding the optimization of different functionals given moment constraints 
(e.g. |[4[-p2)). However, we are not aware of any previous work bounding the maximum Bayes error given moment 
constraints. Some related problems are considered by Antos et al. p3) ; a key difference to their work is that while 
we assume moments are given, they instead take as given iid samples from the class-conditional distributions, and 
they then bound the average error of an estimate of the Bayes error. 



After some mathematical preliminaries, we give lower bounds for the maximum Bayes error in Section III We 
construct our lower bounds by creating a truncated moment problem. The existence of a particular lower bound 
then depends on the feasibility of the corresponding truncated moment problem, which can be checked using Curto 
and Fialkow's solutions p4) (reviewed in the appendix). In Section |IV] we show that the approach of Lanckreit et 
al. |[TOl, which assumes a linear decision boundary, can be extended to provide an upper bound on the maximum 
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Bayes error. We provide an illustration of the tightness of these bounds in Section |V] then end with a discussion 
and some open questions. 

II. Bayes Error 

Let X he a vector space and let 3^ be a finite set of classes. Without loss of generality we may assume that 
y = {1, . . . , G}. Suppose that there is a measurable classification function h : X ^ Sq where Sq is the (G — 1) 
probability simplex. Then the ith component of h{x) can be interpreted as the probability of class i given x, and 
we write p{i\x) — h{x)i. 

For a given x ^ X, the Bayes classifier selects the class y{x) that maximizes the posterior probability p{i\x) (if 
there is a tie for the maximum, then any of the tied classes can be chosen). The probability that the Bayes classifier 
is wrong for a given x is 

Pe{x) = 1 - maxp(i|a;). (II.l) 

Suppose there is a probability measure v defined on X. Then the Bayes error is the expectation of P^: 

E[Pe] = 1 - / m&-yip{i\x)dv{x). (II.2) 
Jxex 

The fact that the p{i\x) must sum to one over i, and thus maxip(?|x) > 1/G, implies a trivial upper bound on the 
Bayes error given in (|II.2|i: 



~ G ■ 

Suppose that the probability measure v is defined on X such that it is absolutely continuous w.r.t. the Lebesgue 
measure such that it has density p{x). Or suppose that it is discrete and expressed as 

oo 

where is the Dirac measure with support Xj, aj > for all j = 1,2,... and J^TLi ^^j = 1' ™d we say the 



density p{xj) = aj. In either case, (II. 2 1 can be expressed in terms of the ith class prior p{i) — J p{i\x)di'{x) and 



ith class-conditional density p{x\i) (or probability mass function p{xj\i)) as follows 

oo 

maxp(xj in the discrete case 

E[Pe] = { i = l (11.3) 

1 — / max p{x\i)p{i)dx in the absolutely continuous case 
J 

If J/ is a general measure then Lebesgue's decomposition theorem says that it can be written as a sum of three 
measures: i/ = + Vac + Vsc- Here Vd is a discrete measure and the other two measures are continuous; Vac is 
absolutely continuous w.r.t. the Lebesgue measure, and Vsc is the remaining singular part. We have a convenient 
representation for both the discrete and the absolutely continuous part of a measure but not for the singular portion. 
For this reason we are going to restrict our attention to measures that are either discrete or absolutely continuous 
(or a linear combination of these kind of measures). 

III. Lower Bounds for Worst-Case Bayes Error 

Our strategy to providing a lower bound on the supremum Bayes error is to constrain the G probability 
distributions p{x\i), i E y to have an overlap of size e G (0,1). Specifically, we constrain the G distributions 
to each have a Dirac measure of size e at the same location. In the case of uniform class prior probabilities this 
makes the Bayes error at least e^^r^- The largest such e for which this overlap constraint is feasible determines the 
best lower bound on the worst-case Bayes error this strategy can provide. The maximum such feasible e can be 
determined by checking whether there is a solution to a corresponding truncated moment problem (see the appendix 
for details). Note that this approach does not restrict the distributions from overlapping elsewhere which would 
increase the Bayes error, and thus this approach only provides a lower bound to the maximum Bayes error 
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We first present a constructive solution showing that no matter what the first moments are, the Bayes error can 
be arbitrarily bad if only the first moments are given. Then we derive conditions for the size of the lower bound for 
the two moment case and three moment case, and end with what we can say for the general case of n moments. 

Lemma III.l. Suppose the first moments 71 ^ are given for each i in a subset of {1, . . . , G} and the remaining 
class-conditional distributions are unconstrained. Then for all 1 > e > one can construct G discrete or absolutely 
continuous class-conditional distributions such that the Bayes error IE[Pe] > (1 ^ niax^gj; p(i))e. 

Proof: This lemma works for any vector space X. The moment constraints hold if the ith class-conditional 
distribution is taken to be p{x\i) — eSo{x) + (1 — e)6z^{x) where Zi — -^Jzr^- This constructive solution exists 
for any e e (0, 1) and yields a Bayes error of at least (1 — maxi^y p{i))e. To see this, substitute the G measures 



{eSo + (1 — ()Szi} into (II. 3 1 to produce 

00 

E[Pe] = 1 - y max (eSoixj) + (1 - e)6,,{xj)) p{i) 



maxp(i)e + maxp(i)(l — f-)5zi {xj] 
> 1 — ( em^p(z) + (1 — e) ^ 1 



= e 1 — maxn(i' 
V ^ey ' 

For an absolutely continuous example, consider X = M.. The uniform densities Pi{x\i) = 
2i^^^[ji i-{ip{i)),-yi i+{ip(i})]{x) with i = 1,2,...,G (where I^; is the indicator function of the set E) 
provide class-conditional distributions such that as I 00 the Bayes error goes to 1 — maxi^y p{i) . To see this, let 
i* e argmaxjg-y and consider the difference di — ji.i' +lp{i*) — {ji.i+lp{'i)) = (7i.i* ~7i,i) 
If p{i*) — p{i) > then di 00 as I 00 therefore there is an Z' > such that if I > I' then di > and 
hence 71 + lp{i*) > 7i^i + lp{i)- A similar derivation shows that there is an /" > such that if / > I" then 
71, i* — Ipii*) < 71, i ~ Ipi'i)- In other words, if p{i*) — p{i) > then p{i*)p{x\i*) eventually dominates p{i)p{x\i) 
since all the functions p{3)p{x\j), j — 1, . . . ,G have the same amplitude ^. If p{i*) —p{i) = then the integral of 

the function p{i)p{x\i) that is not dominated by p{i*)p{x\i*) is ^'^^'^ '^^''^ Q as I ^ 00. Finally, the integral of 
the dominant function p{i*)p{x\i*) is Yi'^lp{i*) = p{i*) and therefore the Bayes error approaches 1 — maxjgj;p(i) 
as ? — cx). 

■ 

Theorem III.l. Suppose that X = and that there exist G class-conditional measures with moments {71,4,72,4}, 
i ^y. Given only this set of moments {71.4,72,4}, i a lower bound on the supremum Bayes error is 



supE[Pe] > sup 

A6K 



^ V 72,« + A2 - 2A71, J ^eyY^'\ 

ill. A)2 



(71,. - A)2 
72.J + A2 - 2A71,, 



> (G - 1) sup min <^ p{i) 1 - 



AGR ^ey r V 72,» + A2 - 2A71,, 



where the supremum on the left-hand-side is taken over all combinations of G class- conditional measures satisfying 
the moment constraints. 
// G = 2 then 

(71. - A)2 



supE[Pe] > sup min p{i) 1 \_ A2 oA ' ^"^'^^ 

AeR»=i.2 L V 72,4 + A2 - 2A7i,i/ J 

Further, if the class priors are equal, then, in terms of the centered second moment af — 72.4 — 7^ j, the optimal 
A value is one of 

^ ^ -(7i,2g? - 71, 1^1) ± ^^1^2 |7i,i - 71,2! ^jjj 2) 
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if <7i ^ (T2- Otherwise, if a\ — a^. 



71,1 + 71,2 



(III.3) 



and the lower bound simplifies: 



supE[Pe] > 



2al 



4CTf + (71,1 - 71,2)^ 



Proof Consider some < e < 1. If the class prior is uniform then a sufficient condition for the Bayes error 
to be at least ^=^e is if all of the unknown measures share a Dirac measure of at least e. First, we place this 
Dirac measure at zero and find the maximum e for which this can be done. Then later in the proof we show that a 
larger e (and hence a tighter lower bound on the maximum Bayes error) can be found by placing this shared Dirac 
measure in a more optimal location, or equivalently, by shifting all the measures. 

Suppose a probability measure /i can be expressed in the form eS^ + jl where jl is some measure such that 
/i({0}) = 0. If /i satisfies the original moment constraints then /2 also satisfies them; this follows directly from the 
moment definition for n > 1: 



x'^d^l{x) = 0"e + j x''djl(x) = j x'^djlix). 

Also fl{X) — 1 — e. Thus, we require a measure jl with a zeroth moment 70 = 1 — e > and the original first 
and second moments 71,72. Then, as described in Section \VU\ there are two conditions that we have to check. In 
order to have a measure with the prescribed moments, the matrix 



A^A{1) 



1 -e 
71 



7i 

72 



has to be positive semidefinite, which holds if and only if e < 1 — (Note that the Theorem assumes that there 
exists a distribution with the given moments, and thus the above implies that 72 > 7^). Moreover, the rank of 
matrix A and the rank of 7 (for notation see Section |VII| have to be the same. Matrix A can have rank 1 or 2. If 
rank(yl) 1 then the columns of A are linearly dependent and therefore rank(7) = 1. If rank(A) = 2 then A is 
invertible and rank(7) ~ 2. Thus there is a measure jl with moments {1 — e, 71, 72} iff < e < 1 — — . If such 

a jl exists, then there also exists a discrete probability measure with moments {1 — £,71,72} and < e < 1 — ^ 
by Curto and Fialkow's Theorem 3.1 and Theorem 3.9 | [T4| . 

Suppose we have G such discrete probability measures satisfying the corresponding moments constraints given 
in the statement of this theorem. Denote the \th discrete probability measure by vi = e^Sq + J^'iLi ctj,i^x where 



72,1 



and Xj ^ for all j, and j indexes the set of all non-zero atoms in the G discrete measures 



< e,; < 1 - 

{vi}. Then the supremum Bayes error is bounded below by the Bayes error for this set of discrete measures: 



supE[Pe] > 1 



> 1 



= 1 



max|ei»(i)| 



max{eip(i)} 



max{eip(i)} 



00 



^ maxa.ipfi) 
^ iey ■'' 

00 G 

G 00 

p{i) a, 



= 1 — max{eip(i)} 

i^y 



1=1 

G 

E 

1=1 



(111.4) 



(III.5) 



(III.6) 



max{eip(i)}. 

rey 



(111.7) 



7i,. 

12,: 



This means that sup]E[Pe] is an upper bound for 



admissible ti. The domain of the function (III.7i is the Cartesian product Y\ 



III.7 1 



This is true for any collection of e^, < 1 
for these admissible e,; and we can find a tighte r inequality by finding the supremum of (|III.7|) over the set o 

r 7?n — 

0,1 . It is a non-empty 
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compact set and ( |III.7 1 is continuous, so we can expect to find a maximum. The maximum is unique and to find it 
let (ei, . . . , ec) be any element in the domain and let i* £ argmaxj ^1 — |. Since p{i) > 0, we have 



G 

'^p{i)ei - max{p(i)ej < ^ p{i)et 

i—l i^i* 




max < p(i) 
iey \ 




Therefore 



G 

supE[Pe] > VpW 



1=1 



> (G - 1) min <^ pii) 
' iey \ ' 




(IIL8a) 



where the supremum on the left-hand-side is taken over all the combination of class-conditional measures that 
satisfy the given moments constraints. 

The next step follows from the fact that the Lebesgue measure and the counting measure are shift-invariant 
measures and the Bayes error is computed by integrating some functions against those measures. Suppose we had 
G class distributions, and we shift each of them by A. The Bayes error would not change. However, our lower 
bound given in (111.8a i depends on the actual given means {71, i}, and in some cases we can produce a better lower 
bound by shifting the distributions before applying the above lower bounding strategy. The shifting approach we 
present next is equivalent to placing the shared e measure someplace other than at the origin. 

Shifting a distribution by A does change all of the moments (because they are not centered moments), specifically, 
if /i is a probability measure with finite moments 70 = 1, 71, ... , 7„, and /ia is the measure defined by piA{D) = 
n{D + A) for all /z-measurable sets D, then the n-th non-centered moment of the shifted measure /za is 



In 



x''dfiA{x)= / (2:- A)"d^(x) = y(-l) 



i—k 



i—k„ 



Ik, 



where the second equality can easily be proven for any cr-finite measure using the definition of integral. This same 
formula shows that shifting back the measure will transform back the moments. 

For the two-moment case, the shifted measure's moments are related to the original moments by: 

71 = 71 - A 

72 = 72 + - 2A71. 

Then a tighter lower bound can be produced by choosing the shift A that maximizes the shift-dependent lower 
bound given in (III. Sal: 



SUpE[Pe] > sup 



■ G 

E 

.1=1 



Pii) 1 - 



72/ 



> (G — 1) sup min < p{i) 
If G — 2 then this lower bound is 



(71. - A)^ 



A2 - 2A71, 



max < p(i) 



(71, - A)2 



72,, + A2 - 2A71 , 



A)2 



72,, + A2 - 2A71,, 



sup 

Ask 



Ep(* 



(71,. - A)2 



72, i 



A2 



2A71,, 



max < p(i) 
iey ' 



ill, - A)= 



A2 - 2A71,, 



72,j 



sup min < 1 

AeK*=i " 



72, 



(7i,.^A)2 . 
: + A2-2A7i,,, 
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We can say more in the case of equal class priors, that is, if = p{2) ~ 1/2. The functions 

^A^^ 

72, i + - 2A7i,i 

are maximized at A = 71 j, where the maximum value is 1 and the derivative of function fi is strictly positive 
for A < 7i i and strictly negative for A > 71 i, i = 1,2. This means that the potential maximum occurs at the 
point where the two functions are equal. This results in a quadratic equation if crl ^ af with solutions (III.2i, and 
otherwise a linear one with solution ( |III.3| l. 

If 71,1 = 71,2 then the function fi with smaller 72, i will provide us with the lower bound which is 1/2, as 
expected. If 71,1 7^ 71,2 then since fi{ji^i) — 1 the maximum occurs at a A value which is between the two 71. i. 
To see this let J be the interval defined by the two 71 i. As a consequence of the strict nature of the derivatives, 
for any A value outside of the interval J the function 

. J, (7M-A)2 
mm < 1 — 



^<^y [ 12,1 + A2 - 2A71,, 

is less than on J. But on J the function /i(A) — /2(A) is continuous and thanks to the fact that fi{ji.i) = 1 and 
the behavior of the derivatives, it has different sign at the two endpoints of J. This means that there is a A e J 
such that /i( A) - /2(A) = 0. ■ 

This theorem appHes only to one-dimensional distributions. The approach of constraining the distributions to 
have measure e at a common location can be extended to higher-dimensions, but actually determining whether the 
moment constraints can still be satisfied becomes significantly hairier; see y4J for a sketch of the truncated moment 
solutions for higher dimensions. 

An argument similar to the one given in the last two paragraphs of the previous proof can be used to show that 
if 72.2 ~ 7i i are all equal for all i e 3^ and any finite G, and if the class priors are equal, then the optimal A is 
7i,min+7i.max ^ ^}jgj.g j^^^^-^^ Qjid 7i,max ^rc thc smallcst and the largest values in the set {ji,i}iey, respectively. To 
see this, we start with rewriting the function ff. 

..(A) = 1 - A)^ ^ 4 

' 72,2 + A2 - 2A71,, + (A -71,02- 

This shows, that if the condition mentioned above holds then the functions fi are shifted versions of each other. 
Let /min and /max be the functions corresponding to 7i,min and 71, max, respectively and let A' be the point where 
/mill and /max intcrscct. Because of the strict nature of the derivatives of fi and because the functions are shifted 
versions of each other, for any A > A', /min is smaller than any other fi. Because of symmetry, it is true that for 
any A < A', /max is smaller than any other fi. Again, by symmetry we have that A' — 7i.min+7i.max therefore 
this is the optimal A. 

Corollary III.l. Suppose that X — M. and that the first, the second and the third moments are given for G class- 
conditional measures, i.e. for the ith class-conditional measure we are given {71,1,72,1,73,1}- Then the Bayes error 
has lower bound 



sup]E[Pe] > sup sup 
5>o 



.1=1 



(7M-A)2 \ / (7M-A) 



''^^ ( ' - 72. + A2-2A7M ' V " "^e^ Y^^ V ' 72,. + A2-2A7M " ' 



> (G - 1) sup min <j p{i) I 1 



AGR ^ey y \ 72,,: -h A2 - 2A71,, 
Proof: In this case we have a list of four numbers {1 — £,71,72,73} and again 

1 - e 71 



A = A{1) = 



71 72 



If7o = l — e>0 then A is positive definite ife<l — (5<1 — In this case V2 — (72,73)"'" and it is 
in the range of A since A is invertible. The statements in Section VII imply that there is a measure with moments 
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{1 — e,7ii72,73} ™d consequently that 



supE[Pe] > sup 

s>o 



>(G 




ill 

72,1 



1) min < p(i 



The rest of the proof follows analogously to the proof of Theorem ■ 
The proof of Corollary III. 1 relies on the fact that for 5 > the matrix A{1) featured in the proof is invertible, 
so one of the conditions for the existence of a measure with the given moments is automatically satisfied (see 
Appendix). If S = then A{1) is only positive semidefinite and it is not obvious that the vector V2 is in the range 
of 

The following lemma is stated for completeness. 

Lemma III.2. Suppose that A" = M and that the first n moments are given for G equally likely class-conditional 
measures, i.e. for the ith class-conditional measure we are given {71.4,72,4, ■ • • ,7n.i}- Then if there exist measures 
of the form eiSg + where Vi satisfies the moments conditions given above the corresponding Bayes error can be 
bounded from below: 



SUpE[Pe 



maxjeij + 

iey 



G 

E 

i=l 



(1 



1 

G 



G 

.1=1 



maxjeij 



where the supremum on the left is taken over all the measures satisfying the moment constraints noted above. 



Proof: The first part of the proof of Theorem |III.1| is applicable in this case. ■ 
As in the case for two moments, the lower bound can be further tightened by optimizing over all possible shifts 
of the overlap Dirac measure. 



IV. Upper Bound for Worst-Case Bayes Error 

Because the Bayes error is the smallest error over all decision boundaries, one approach to constructing an upper 
bound on the maximum Bayes error is to restrict the set of considered decision boundaries to a set for which the 
worst-case error is easier to analyze. For example, Lanckreit et al. fTO'l take as given the first and second moments 
of each class-conditional distribution, and attempt to find the linear decision boundary classifier that minimizes the 
worst-case classification error rate with respect to any choice of class-conditional distributions that satisfy the given 
moment constraints. Here we show that this approach can be extended to produce an upper bound on the supremum 
Bayes error for the G — 2 case. 

Let X be any feature space. Suppose one has two fixed class-conditional measures i^i, 1^2 on X. As in Lanckreit et 
al. |T0|, consider the set of linear decision boundaries. Any linear decision boundary splits the domain into two half- 
spaces 5*1 and S2. We work with linear decision boundaries because these are the only kind of decision boundaries 
that split the domain into two convex subsets. The error produced by a linear decision boundary corresponding to 
the split (5*1,52) is 

p(l)j^i(52) +p{2)iy2{Si) > (IV.l) 

That is, the error from any linear decision boundary upper bounds the Bayes error ]E[Pe] for two given measures. 
To obtain a tighter upper bound on the Bayes error, minimize the left-hand side over all linear decision boundaries: 



inf ipil)MS2) 

01,62 



'P(2)z^2(5l)) >E[Pe](i^l,l^2). 



(IV.2) 



Now suppose z^i and 1^2 are unknown, but their first moments (means) and second centered moments 
and (/i2,S2) are given. Then we note that the supremum over all measures i^i and 1^2 with those moments of the 
smallest linear decision boundary error forms an upper bound on the supremum Bayes error where the supremum 
is taken with respect to the feasible measures z^i, 1^2: 



supE[Pe] < sup sup inf {p{l)iyi{S2) + pi2)iy2iSi)) 



(IV.3) 



< inf 

Si ,82 



p{l) sup 1^1 (52) 

I i/i|/^i,Si 



-p{2) sup t^2(<S'i) 



(IV.4) 
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This upper bound can be simplified using the following resulQby Bertsimas and Popescu 1 15| (which follows 
from a result by Marshall and Olkin p6)): 

sup i^(S) = ^-—^ where c(S) = inf (x - fif^-'^(x - /i), (IV.5) 

1 + c(S) xes 



where the sup in ( IV.5 i is over all probability measures with domain X, mean /i, and centered second moment 



E; and S is some convex set in the domain of i/. 



Since 5*1 and 5*2 in ( IV.4 1 are half-spaces, they are convex and ( IV.5 i can be used to quantify the upper bound. 
For the rest of this section let X be one dimensional. Then the covariance matrices Ei and S2 are just scalars that 
we denote by and (t|, respectively. In one-dimension, any decision boundary that results in a half-plane split is 
simply a point s e M. Without loss of generality with respect to the Bayes error, let /ii = and /^i < ^2- Then 



c(S'i) and 0(82) in (IV.5i simplify (for details, see Appendix A of ||10j), so that (IV.4i becomes 



supE[P,] < min { inf ( ^ + ^^f^^ | , 1 ^ (IV.6) 

If (Ti = (72 and p{l) — p{2) ~ 1/2, then the infimum occurs at s = /i2/2 and the upper bound becomes 

f 4 1 

supE[Pe] < min < , - 

For this case, the given upper bound is twice the given lower bound. 

V. Comparison to Error With Gaussians 

We illustrate the bounds described in this paper for the common case that the first two moments are known for 
each class, and the classes are equally likely. We compare with the Bayes error produced under the assumption 
that the distributions are Gaussians with the given moments. In both cases the first distribution's mean is and 
the variance is 1, and the second distribution's mean is varied from to 25 as shown on the x-axis. The second 
distribution's variance is 1 for the comparison shown in the top of Fig. [T] The second distribution's variance is 5 
for the comparison shown in the bottom of Fig. [T] For the first case, ai = so the infimum in (IV.61 occurs at 



s = /i2/2 and the upper bound is min {4/ (4 + /if), ^ }. For the second case with different variances we compute 
( |IV.6| l numerically. 

Fig. [T] shows that the Bayes error produced by the Gaussian assumption is optimistic compared to the given 
lower bound for the worst-case (maximum) Bayes error Further, the difference between the Gaussian Bayes error 
and the lower bound is much larger in the second case when the variances of the two distributions differ. 

VI. Discussion and Some Open Questions 

We have provided a lower and upper bound on the worst-case Bayes error, but a number of open questions arise 
from this work. 

Lower bounds for the worst-case Bayes error can be constructed by constraining the distributions. We have shown 
that constraining the distributions to be Gaussians produces a weak lower bound, and we provided a tighter lower 
bound by constraining the distributions to overlap in a Dirac measure of e. Given only first moments, our lower 
bound is tight in that it is arbitrarily close to the worst possible Bayes error Given two moments, we have shown 
that the common QDA Gaussian assumption for class-conditional distributions is much more optimistic than our 
lower bound and increasingly optimistic for increased difference between the variances. However, because in our 
constructions we do not control all the possible overlap between the class-conditional distributions, we believe it 
should be possible to construct tighter lower bounds. 

On the other hand, upper bounds on the worst-case Bayes error can be constructed by constraining the considered 
decision boundaries. Here, we considered an upper bound resulting from restricting the decision boundary to be 
Unear. For the two moment case, we have shown that work by Lanckreit et al. leads almost directly to an upper 
bound. However, the inequality we had to introduce in ( IV.4| i when we switched the inf and sup may make this 



upper bound loose. It remains an open question if there are conditions under which the upper bound is tight. 



' Some readers may recognize this result as a strengthened and generalized version of the Chebyshev-Cantelli inequality. 
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Fig. 1. Comparison of the given lower bound for the worst-case Bayes error with the Bayes error produced by Gaussian class-conditional 
distributions. 

Our result that the popular Gaussian assumption is generally not very robust in terms of worst-case Bayes error 
prompts us to question whether there are other distributions that are mathematically or computationally convenient 
to use in generative classifiers that would have a Bayes error closer to the given lower bound. 

In practice, a moment constraint is often created by estimating the moment from samples drawn iid from the 
distribution. In that case, the moment constraint need not be treated as a hard constraint as we have done here. 
Rather, the observed samples can imply a probability distribution over the moments, which in turn could imply a 
distribution over corresponding bounds on the Bayes error. A similar open question is a sensitivity analysis of how 
changes in the moments would affect the bounds. 

Lastly, consider the opposite problem: given constraints on the first n moments for each of the class-conditional 
distributions, how small could the Bayes error be? It is tempting to suppose that one could generally find discrete 
measures that overlapped nowhere, such that the Bayes error was zero. However, the set of measures which satisfy 
a set of moment constrains may be nowhere dense, and that impedes us from being able to make such a guarantee. 
Thus, this remains an open question. 
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Appendix 

VII. Existence of Measures with Certain Moments 

The proof of our theorem reduces to the problem of how to check if a given list of n numbers could be the 
moments of some measure. This problem is called the truncated moment problem; here we review the relevant 
solutions by Curto and Fialkow p4) . 

Suppose we are given a list of numbers 7 = {70,71, • . . ,7n}, with 70 > 0. Can this collection be a list of 
moments for some positive Borel measure on K such that 

7, = y sVz/(s)? (VII. 1) 

Let k — ln/2\, and construct a Hankel matrix A{k) from 7 where the ith row of A is [7^-1 7^ . . -Ji-i+k]- For 
example, for n = 2 or n = 3, fc = 1: 

70 71 



^(1) 



71 72 



Let Vj be the transpose of the vector {'~fi+j)i^Q- For < j < k this vector is the j + 1th column of A{k). Define 
rank(7) = fc + 1 if A{k) is invertible, and otherwise rank(7) is the smallest r such that is a linear combination 

of {Vo, . . . , Vr-l}. 



Then whether there exists a v that satisfies (VII.l 1 depends on n and fc: 

1) If n = 2fc + 1, then there exists such a solution v if A{k) is positive semidefinite and v^+i is in the range of 

A{k). 

2) If n = 2fc, then there exists such a solution ly if A{k) is positive semidefinite and rank(7) = rank(yl(fc)). 



Also, if there exists a that satisfies ( VII. 1 1, then there definitely exists a solution with atomic measure. 
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